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This study aims to increase the processing time of detecting non-rice objects 
based on the you only look once v3-tiny (YOLOv3-tiny) model. The system 
was developed on the Raspberry Pi 4 embedded system with the Movidius 
neural compute stick 2 (NCS 2) implementation approach. Data object in the 
form of gravel on a bunch of rice in the video. The video data was obtained 
using a webcam with a resolution of 1920 x 1080 pixels with a total of 2759 
frames. From the test results, frames per second (FPS) have increased by 
1.27x in the Movidius NCS 2 implementation compared to processing using 
the central processing unit (CPU) from the Raspberry Pi 4. The object 
detection processing on video data is complete at 1871.408 seconds with 
1.474 FPS using the CPU from the Raspberry Pi 4 and finished at 1477.141 
seconds with 1.868 FPS using Movidius NCS 2. From these differences, it 
can be seen that the application of Movidius NCS 2 succeeded in increasing 
the object detection processing in this study by 26.69% with the YOLOv3- 


tiny model approach on the Raspberry Pi 4 embedded system. 
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1. INTRODUCTION 

In this modern era, the application of technology in the agricultural world was increasingly crowded 
[1], [2]. Several approaches have been taken to support the efficiency of agriculture. Besides using the 
internet of things (IoT) for smarthome systems [3], [4], for plant monitoring, both in terms of soil fertility to 
plant health regularly and automatically [5], [6]. The more advanced the agriculture monitoring process has 
developed, the more researchers are researching the camera-based agricultural monitoring process [7], [8], 
combined with the IoT system on end computing [9], [10]. The camera is used as a sensor which then the 
data obtained, processed, and then can be used as a reference for concluding, and some conclusions are 
drawn using machine learning. 

The combination of using cameras as input for machine learning is considered suitable for solving 
object detection problems in the monitoring process [11]—[14]. However, in the case of remote monitoring, 
the latency of sending and processing data is considered a problem if the detection process is carried out on 
the server. To overcome this, few studies have also been carried out related to the object detection process in 
edge computing. In question, edge computing is carried out on the client/sensor side so that the computing 
process is faster than computing on the server. 

To support the complexity of tools on edge computing, in this study, a study was conducted 
regarding the application of the Raspberry Pi to the camera-based object detection task. Not only referring to 
computing using the central processing unit (CPU) of the Raspberry Pi but the Movidius neural compute 
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stick 2 (NCS 2) is also applied in this study. Furthermore, the CPU usage of the Raspberry Pi was compared 
with NCS 2, where NCS 2 serves to speed up the computing process on low-performance devices [15], such 
as its implementation on Raspberry Pi [16]—[18]. 


2. METHOD 
2.1. Research method 

This research focuses on the application of object detection on the Raspberry Pi embedded system to 
detect gravel in a bunch of rice. The model used for the detection process is you only look once (YOLO) v3- 
tiny (YOLOv3-Tiny) with the Darknet-19 network, where YOLOv3-tiny is an upgraded algorithm of YOLO 
[19], YOLO9000 [20], and YOLOv3 [21], which includes updated development of the convolutional neural 
network (CNN) [22]. YOLOv3-Tiny is a development of YOLOv3 with fewer networks to optimize 
computing speed [23]. Darknet-19 used in the YOLOv3-Tiny configuration can be seen in Table 1. There are 
several stages in this research. The research stages can be seen in Figure 1. 


Table 1. Darknet-19 architecture on the YOLOv3-tiny algorithm [23] 


Layer Type Filters Size/Stride Input Output 
0 Convolutional 16 3x 3/1 416x416x3 416x416x 16 
1 Maxpool 2x 2/2 416x416x16 208x208 x 16 
2 Convolutional 32 3 x 3/1 208 x 208x 16 208 x 208 x 32 
3 Maxpool 2x 2/2 208 x 208x 32 104x 104. x 32 
4 Convolutional 64 3 x 3/1 104 x 104x32 104x 104 x 64 
5 Maxpool 2x 2/2 104 x 104 x 64 52 x 52 x 64 
6 Convolutional 128 3x 3/1 52 x 52 x 64 52 x 52 x 128 
7 Maxpool 2x 2/2 52 x 52 x 128 26x 2x 128 
8 Convolutional 256 3x 3/1 26x 2x 128 26 x 26 x 256 
9 Maxpool 2x 2/2 26 x 26 x 256 13 x 13 x 256 
10 Convolutional 512 3x 3/1 13 x 13 x 256 13x 13x 512 
11 Maxpool 2x 2/2 13x 13x 512 13 x 13x 512 
12 Convolutional 1024 3x 3/1 13x 13x 512 13 x 13 x 1024 
13 Convolutional 256 Tx 1/1 13 x 13 x 1024 13 x 13 x 256 
14 Convolutional 512 3x 3/1 13 x 13 x 256 13x 13x 512 
15 Convolutional 255 1x1/1 13 x 13 x 512 13 x 13 x 255 
16 YOLO 
17 Route 13 
18 Convolutional 128 1x1/1 13 x 13 x 256 13 x 13 x 128 
19 Up-sampling 2x 2/2 13 x 13 x 128 26 x 26 x 128 


20 Route 19 8 

21 Convolutional 256 3 x 3/1 13 x 13 x 384 13 x 13 x 256 
22 Convolutional 255 Tx I/1 13 x 13 x 256 13 x 13 x 256 
23 YOLO 
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Figure 1. Research stages 


At this stage, the image data used for training and model validation is annotated first. Furthermore, 
the data is separated between image data used as training data and image data used as validation data. After 
the image data is divided, the training model begins. Finally, after the trained model was obtained, the model 
was tested using two computational methods: the CPU from the Raspberry Pi 4 model B and the Intel 
Movidius NCS 2 to get computational speed data between the two methods. 
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As a detection model evaluation method, the mean average precision (mAP) is used in this study to 
support data analysis, because mAP is a popular measurement method to obtain the accuracy of the object 
detector [11]. The mAP value is obtained from the average value of the total average precision (AP) of each 
class, where the AP is determined from the precision-recall (PR) curve obtained from the association of each 
detection instance with overlapping ground truths. The mAP value is represented by the equation mAP = 


ZZL AP, where N is the total number of classes. The YOLO loss function exploits sum-squared error to 


calculate loss between the ground truth and the prediction, which contains localization loss, confidence loss, 
and classification loss [19]. The loss function is represented by (1). 


S2 B s2 B 
2 -2 
Loss = coord yyw [x Ss 2)? T Qi- Ji)? 1+ Ancora >.) 1G ie (Caz = vi) + (Vh; > Vii) 
i=0 j=0 i=0 j=0 
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S2 B s2 B 
obj noobj obj p 2 
DPO (C= Gi) Faam > Te (ci - ê Sy ek (PO -= Pc) 
i=0 j=0 i=0 j=0 (c € classes) 


obj 


Where S represents cell grid to predict B bounding box in coordinate x,y, w,h. lij value is 1 if the jt” 


bounding box in cell i is responsible to detect the object, otherwise will produce value 0, while Acoora used to 
increase weight for loss in bounding box coordinates. C; represents the confidence score of the box j in cell i, 
while Anoop; denotes weight down the loss when background is detected, and also, P;(c) represents the 
conditional class probability for class c in cell i. The model tested by detecting the spreading rice containing 
gravel. For the model testing purpose, the rice was loaded by hand and spread on the table that had been set 
before by installing webcam above the table. 


2.2. Hardware and software 

To support this research, hardware and software are needed in testing the gravel detection process. 
The YOLOv3-tiny model was trained using a personal computer (PC) with an intel core i5-8400 2.8 GHz 
processor, 8 gigabytes of random access memory (RAM), and an NVIDIA GeForce GTX 1070 Ti graphics 
processing unit (GPU). The hardware implementation used is a Raspberry Pi 4 model B with 4 gigabyte 
RAM equipped with a broadcom BCM2711 processor, quad-core cortex-A72 (ARM v8) 64-bit system-on-a- 
Chip (SOC) @ 1.5 GHz. In comparison to the processing speed test, the hardware is provided with a 
MOVIDIUS NCS 2 with Intel Movidius myriad X vision processing unit (VPU) with 16 SHAVE cores (128- 
bit very long instruction word (VLIW) vector processors). In addition to embedded systems and processing 
components, on the hardware aspect, the logitech C525 webcam is used in this study to record video as data 
to be processed with 720 p resolution. NCS 2 is used in this study because, so far, the neural compute stick is 
quite successful in handling CNN-based object detection processing [24], [25]. 

Software configuration is also applied in this study to support hardware performance. On Raspberry 
Pi 4 model B, Debian v11 based RaspiOS Bullseye is implemented. RaspiOS is equipped with the Python, 
OpenCV, and OpenVino programming languages as a framework to help test the computing speed between 
the CPU and NCS 2. Meanwhile, to assist the model training process on a PC, CUDA 9.0, and cuDNN 7.5 
were implemented equipped with /abelimg as a means of labeling data and a Darknet framework to run the 
YOLOv3-Tiny algorithm training process. 


3. RESULTS AND DISCUSSION 

The Raspberry Pi 4 model B-based hardware in this test was assembled with and without Intel 
Movidius NCS 2. This was done to support 2 test methods. The first test is the operation of the Raspberry Pi 
4 model B to run computing on the CPU-based object detection process. In comparison, the second test is the 
operation of the Raspberry P4 model B to run computing on the object detection process based on Intel 
Movidius NCS 2. 


3.1. The results of training and model validation 

The hyperparameter configuration of the YOLOv3-tiny algorithm was obtained and worked well for 
the training data process with decay settings of 0.0005, momentum 0.9, saturation 1.5, exposure 1.5, and Hue 
0.1. The learning rate in the training process is set at 0.001 and the training model is recorded in every 1000 
iterations with a maximum iteration of 15000. The training and validation process results are loss values that 
affect the object detection process in the image. The model training result means that the smaller the resulting 
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loss value and the larger the mAP value, the better the model detecting objects in the new image data. In the 
training process, the YOLOv3-tiny configuration was executed with 15000 iterations. The training process 
result can be seen in Figure 2. Figure 2 shows that the model training process minimized the loss value of 
2.14 in the 1000" iteration to 0.27 in the 15000" iteration. 
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Figure 2. The results of the model training process 


3.2. Model test results 

The model that has been trained is tested using 2759 frames of video with a resolution of 1920 x 
1080 px, which each frame represented in orange dots for Movidius NCS 2 processing and blue dots for 
Raspberry Pi 4 model B CPU processing. The test results using the Raspberry Pi 4 model B CPU obtained an 
object detection processing speed of 1.474 frames per second (FPS). On the other hand, testing using the 
Raspberry Pi 4 model B with Intel Movidius NCS 2, the object detection processing speed was obtained at 
1.868 FPS. The graph of the test results can be seen in Figure 3. 
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Figure 3. The comparison of object detection processing speed 


3.3. System test results 

After the training process was complete, the model was tested on a Python and OpenCV-based 
detection and classification system using image data that the model had not recognized. Model testing in this 
study was carried out using a video containing 2759 frames with a resolution of 1920 x 1080 px. Image data 
taken from output video results can be seen in Figure 4. Test results using video data on; 3 objects in 
Figure 4(a), 4 objects in Figure 4(b), 5 objects in Figure 4(c), 6 objects in Figure 4(d), 7 objects in Figure 
A(e), and 8 objects in Figure 4(f). 
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Figure 4. Test results using video data on (a) 3 objects, (b) 4 objects, (c) 5 objects, (d) 6 objects, (e) 7 objects, 
and (f) 8 objects 


4. CONCLUSION 

The test results found that the object detection processing speed increased by 1.27x on the Intel 
Movidius NCS 2 implementation compared to processing using the CPU from the Raspberry Pi 4. Object 
detection processing on video data was completed in 1871.408 seconds with 1,474 FPS using the CPU from 
the Raspberry Pi 4 model B and finished at 1477.141 seconds with 1.868 FPS using Movidius NCS 2. From 
these differences, it can be seen that the application of Intel Movidius NCS 2 succeeded in increasing object 
detection processing in this study by 26.69% with the tiny-YOLOv3 model approach on the Raspberry Pi 4 
model B embedded system. In this study, NCS exploitation to get fps improvement in object detection task 
has been done successfully, and we can potentially use it with some end effector mechanisms to pick out the 
detected non-rice objects in further research. Furthermore, other parallel processing using more than one 
NCS as well as modified algorithm can be used to increase the fps for real-world applications. 
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